Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent.

نویسندگان

  • Sebastien Roch
  • Mike Steel
چکیده

The reconstruction of a species tree from genomic data faces a double hurdle. First, the (gene) tree describing the evolution of each gene may differ from the species tree, for instance, due to incomplete lineage sorting. Second, the aligned genetic sequences at the leaves of each gene tree provide merely an imperfect estimate of the topology of the gene tree. In this note, we demonstrate formally that a basic statistical problem arises if one tries to avoid accounting for these two processes and analyses the genetic data directly via a concatenation approach. More precisely, we show that, under the multispecies coalescent with a standard site substitution model, maximum likelihood estimation on sequence data that has been concatenated across genes and performed under the incorrect assumption that all sites have evolved independently and identically on a fixed tree is a statistically inconsistent estimator of the species tree. Our results provide a formal justification of simulation results described of Kubatko and Degnan (2007) and others, and complements recent theoretical results by DeGIorgio and Degnan (2010) and Chifman and Kubtako (2014).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading

The reconstruction of a species tree from genomic data faces a double hurdle. First, the (gene) tree describing the evolution of each gene may differ from the species tree, for instance, due to incomplete lineage sorting. Second, the aligned genetic sequences at the leaves of each gene tree provide merely an imperfect estimate of the topology of the gene tree. In this note, we demonstrate forma...

متن کامل

Distances that perfectly mislead.

Given a collection of discrete characters (e.g., aligned DNA sites, gene adjacencies), a common measure of distance between taxa is the proportion of characters for which taxa have different character states. Tree reconstruction based on these (uncorrected) distances can be statistically inconsistent and can lead to trees different from those obtained using character-based methods such as maxim...

متن کامل

Point of View Phylogenetic Analysis in the Anomaly Zone

The concatenation method has been widely used as a means of combining data to estimate phylogenetic trees (Huelsenbeck et al. 1996a, 1996b; Glazko and Nei 2003). However, simulation studies have shown that the maximum likelihood (ML) estimate of the species tree for concatenated sequences may be statistically inconsistent if the gene trees are highly heterogeneous (Kolaczkowski and Thornton 200...

متن کامل

Concatenation Analyses in the Presence of Incomplete Lineage Sorting ΠPLOS Currents Tree of Life

Incomplete lineage sorting (ILS), modelled by the multi-species coalescent, is a process that results in a gene tree being different from the species tree. Because ILS is expected to occur for at least some loci within genome-scale analyses, the evaluation of species tree estimation methods in the presence of ILS is of great interest. Performance on simulated and biological data have suggested ...

متن کامل

The Impact of Missing Data on Species Tree Estimation.

Phylogeneticists are increasingly assembling genome-scale data sets that include hundreds of genes to resolve their focal clades. Although these data sets commonly include a moderate to high amount of missing data, there remains no consensus on their impact to species tree estimation. Here, using several simulated and empirical data sets, we assess the effects of missing data on species tree es...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Theoretical population biology

دوره 100C  شماره 

صفحات  -

تاریخ انتشار 2014